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Abstract 

We develop a novel framework to study smooth and strongly convex optimization algo¬ 
rithms, both deterministic and stochastic. Focusing on quadratic functions we are able 
to examine optimization algorithms as a recursive application of linear operators. This, 
in turn, reveals a powerful connection between a class of optimization algorithms and the 
analytic theory of polynomials whereby new lower and upper bounds are derived. Whereas 
existing lower bounds for this setting are only valid when the dimensionality scales with the 
number of iterations, our lower bound holds in the natural regime where the dimensional¬ 
ity is fixed. Lastly, expressing it as an optimal solution for the corresponding optimization 
problem over polynomials, as formulated by our framework, we present a novel systematic 
derivation of Nesterov’s well-known Accelerated Gradient Descent method. This rather 
natural interpretation of AGD contrasts with earlier ones which lacked a simple, yet solid, 
motivation. 

Keywords: Smooth and Strongly Convex Optimization, Full Gradient Descent, Acceler¬ 
ated Gradient Descent, Heavy Ball method 

1. Introduction 

In the field of mathematical optimization one is interested in efficiently solving a minimiza¬ 
tion problem of the form 


min/(x) (1) 

xex 

where the objective function f is some real-valued function defined over the constraints 
set X. Many core problems in the field of Computer Science, Economic, and Operations 
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Research can be readily expressed in this form, rendering this minimization problem far- 
reaching. That being said, in its full generality this problem is just too hard to solve or 
even to approximate. As a consequence, various structural assumptions on the objective 
function and the constraints set, along with better-suited optimization algorithms, have 
been proposed so as to make this problem viable. 


One such case is smooth and strongly convex functions over some d-dimensional Eu¬ 
clidean spac^ Precisely, we consider continuously differentiable / : M which are 

L-smooth, i.e., 

||V/(x)-V/(y)|| <L||x-y||, Vx,y 

and ^-strongly convex, that is, 

/(y) > /(x) + (y - X, V/(x)) + I ||y - xf , Vx, y E R-" 

A wide range of applications together with efficient solvers have made this family of prob¬ 
lems very important. Naturally, an interesting question arises: how fast can these kind 
of problems be solved? better said, what is the computational complexity of minimizing 
smooth and strongly-convex functions to a given degree of accuracy 10 Prior to answering 
these, otherwise ill-defined, questions, one must first address the exact nature of the under¬ 
lying computational model. 


Although being a widely accepted computational model in the theoretical computer 
sciences, the Turing Machine Model presents many obstacles when analyzing optimization 
algorithms. In their seminal work, Nemirovsky and Yudin (1983) evaded some of these 
difficulties by proposing the black box computational model, according to which informa¬ 
tion regarding the objective function is acquired iteratively by querying an oracle. This 
model does not impose any computational resource constraint^ Nemirovsky and Yudin 
showed that for any optimization algorithm which employs a first-order oracle, i.e. receives 
(/(x), V/(x)) upon querying at a point x E there exists an L-smooth /r-strongly convex 
function / : —>■ M, such that for any e > 0 the number of oracle calls needed for obtaining 

an e-optimal solution x, i.e., 


/(x) < min /(x) -b e 
xSR'* 


( 2 ) 


must satisfy 


# Oracle Calls > Q (min {d,-v/Kln(l/e}) 


(3) 


1. More generally, one may consider smooth and strongly convex functions over some Hilbert space. 

2. Natural as these questions might look today, matters were quite different only few decades ago. In his 
book ’Introduction to Optimization’ which dates back to 87’, Polyak B.T devotes a whole section as to: 


’Why Are Convergence Theorems Necessary?’ (See section 1.6.2 in Polyak (1987l). 

3. In a sense, this model is dual to the Turing Machine model where all the information regarding the 
parameters of the problem is available prior to the execution of the algorithm, but the computational 
resources are limited in time and space. 
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where k = L/fj, denotes the so-called condition number. 


The result of Nemirovsky and Yudin can be seen as the starting point of the present 
paper. The restricted validity of this lower bound to the first 0{d) iterations is not a 
mere artifact of the analysis. Indeed, from an information point of view, a minimizer of 
any convex quadratic function can be found using no more than 0{d) first-order queries. 
Noticing that this bound is attained by the Conjugate Gradient Descent method (CGD, see 


Polyak (1987)), it seems that one cannot get a non-trivial lower bound once the number of 
queries exceeds the dimension d. Moreover, a similar situation can be shown to occur for 
more general classes of convex functions. However, the known algorithms which attain such 
behavior (such as GGD and the center-of-gravity method, e.g., Nemirovski ( 2005| )) require 
computationally intensive iterations, and are quite different than many common algorithms 
used for large-scale optimization problems, such as gradient descent and its variants. Thus, 
to capture the attainable performance of such algorithms, we must make additional assump¬ 
tions on their structure. This can be made more solid using the following simple observation. 


When applied on quadratic functions, the update rule of many optimization algorithms 
reduees to a recursive applieation of a linear transformation which depends, possibly ran¬ 
domly, on the previous p query points. 


Indeed, the update rule of GGD for quadratic functions is non-stationary, i.e. uses a dif¬ 
ferent linear transformation at each iteration, as opposed to other optimization algorithms 
which utilize less complex update rules such as: stationary updates rule, e.g.. Gradient De¬ 


scent, Accelerated Gradient Descent, Newton’s method (see Nesterov (2004)), The Heavy 


SDGA (see Shalev-Shwartz and Zhang (2013b)) and SAG (see 


Roux et al. (|2012[)); cyclic update rules, e.g,. SVRG (see Johnson and Zhang (2013)); 


and piecewise-stationary update rules, e.g., proximal methods and Accelerated SDGA (see 


Shalev-Shwartz and Zhang (2013a)). Inspired by this observation, in the present work we 
explore the boundaries of optimization algorithms which admit stationary update rules. We 
call such algorithms p-Stationary Canonical Linear Iterative optimization algorithms (abbr. 
p-SCLI), where p designates the number of previous points which are necessary to generate 
new points. The quantity p can be instructively interpreted as a limit on the amount of 
memory at the algorithm’s disposal. 


Similar to the analysis of power iteration methods, the convergence properties of such 
algorithms are intimately related to the eigenvalues of the corresponding linear transfor¬ 
mation. Specifically, as the convergence rate of the recursive application of a linear trans¬ 
formation is essentially characterized by its largest magnitude eigenvalue, the asymptotic 
convergence rate of p-SCLI algorithms can be bounded from above and from below by an¬ 
alyzing the spectrum of the corresponding linear transformation. At this point we would 
like to remark that the technique of linearizing iterative procedures and analyzing their 
convergence behavior accordingly, which dates back to the pioneering work of the Rus¬ 
sian mathematician Lyapunov, has been successfully applied in the field of mathematical 
optimization many times, e.g., Polyak ( 1987| and more recently Lessard et al. (2014). How¬ 
ever, whereas previous works were primarily concerned with deriving upper bounds on the 


3 























































magnitude of the corresponding eigenvalues, in this work our reference point is lower bounds. 


As eigenvalues are merely roots of characteristic polynomial^ our approach involves 
establishing a lower bound on the maximal modulus (absolute value) of the roots of poly¬ 
nomials. Clearly, in order to find a meaningful lower bound, one must first find a condition 
which is satisfied by all characteristic polynomials that correspond to p-SCLIs. We show 
that such condition does exist by proving that characteristic polynomials of consistent p- 
SCLIs, which correctly minimize the function at hand, must have a specihc evaluation at 
A = 1. This in turn allows us to analyze the convergence rate purely in terms of the analytic 
theory of polynomials, i.e., 


Find min{p(q( 2 ;)) | q{z) is a real monic polynomial of degree p and g(l) = r} (4) 


where r G M and p{q{z)) denotes the maximum modulus over all roots of q{z). Although a 
vast range of techniques have been developed for bounding the moduli of roots of polynomi¬ 
als (e.g. Marden (1966); Rahman and Schmeisser (2002); Milovanovic et al. (1994); Walsh] 


( |1922 ); Milovanovic and Rassias (2000); Fell (1980 


)), to the best of our knowledge, few of 


them address lower bounds (see Higham and Tisseur (2003). The minimization problem (Q 
is also strongly connected with the question of bounding the spectral radius of ’generalized’ 
companion matrices from below. Unfortunately, this topic too lacks an adequate coverage 


in the literature (see Wolkowicz and Styan (1980); Zhong and Huang (2008); Horne (1997); 
Huang and Wang (2007)). Consequently, we devote part of this work to establish new 
tools for tackling (Q. It is noteworthy that these tools are developed by using elementary 
arguments. This sharply contrasts with previously proof techniques used for deriving lower 
bounds on the convergence rate of optimization algorithms which employed heavy machin¬ 


ery from the field of extremal polynomials, such as Chebyshev polynomials (e.g.. Mason 


and Handscomb (2002)). 


Based on the technique described above we present a novel lower bound on the conver¬ 
gence rate of p-SCLI optimization algorithms. More formally, we prove that any p-SCLI 
optimization algorithm over M'’*, whose iterations can be executed efficiently, requires 

^Oracle Calls > 17 (-0cln(l/e)) (5) 

in order to obtain an e-optimal solution, regardless of the dimension of the problem. This 
result partially complements the lower bound presented earlier in Inequality ([^. More 
specihcally, for p = 1, we show that the runtime of algorithms whose update rules do not 
depend on previous points (e.g. Gradient Descent) and can be computed efficiently scales 
linearly with the condition number. For p = 2, we get the optimal result for smooth and 
strongly convex functions. For p > 2, this lower bound is clearly weaker than the lower 
bound shown in (§ at the first d iterations. However, we show that it can be indeed at¬ 
tained by p-SCLI schemes, and surprisingly, some of them can be executed efficiently for 
certain classes of quadratic functions. Finally, we believe that a more refined analysis of 
problem (Q would show that this technique is powerful enough to meet the classical lower 

4. In fact, we will use a polynomial matrix analogous of characteristic polynomials which will turns out to 
be more useful for our purposes. 
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bound ^/k for any p, in the worst-case over all quadratic problems. 


The last part of this work concerns a cornerstone in the field of mathematical optimiza¬ 
tion, i.e., Nesterov’s well-known Accelerated Gradient Descent method (AGD). At the time 
the work of Nemirovsky and Yudin was published, it was known that Gradient Descent 
(GD) obtains an e-optimal solution by issuing no more than 


C>(Kln(l/e)) 

first-order queries. The gap between this upper bound and the lower bound shown in (§ 
has intrigued many researchers in the field. Eventually, it was this line of inquiry that led 
to the discovery of AGD by Nesterov (see Nesterov ( 1983[ )), a slight modification of the 
standard GD algorithm, whose iteration complexity is 


0{y/K\n{l/e)) 


Unfortunately, AGD lacks the strong geometrical intuition which accompanies many op¬ 
timization algorithms, such as FGD and the Heavy Ball method. Primarily based on so¬ 
phisticated algebraic manipulations, its proof strives for a more intuitive derivation (e.g. 


Beck and Teboulle| (2009 

); Baes 

(2009 

); Tseng 

(2008) 

Sutskever et al. 

(2013) 

Allen-Zhu 

and Orecchia 

(2014)). This downside has rendered the generalization of AGD to different 


optimization scenarios, such as constrained optimization problems, a highly non-trivial task 
which up to the present time does not admit a complete satisfactory solution. Surprisingly 
enough, by designing optimization algorithms whose characteristic polynomials are optimal 
with respect to a constrained version of Q, we have uncovered a novel simple derivation of 
AGD. This reformulation as an optimal solution for a constrained optimization problem over 
polynomials, shows that AGD and the Heavy Ball are essentially two sides of the same coin. 


To summarize, our main contributions, in order of appearance, are the following: 

• We define a class of algorithms (p-SGLI) in terms of linear operations on the last p 
iterations, and show that they subsume some of the most interesting algorithms used 
in practice. 

• We prove that any p-SCLI optimization algorithm must use at least 

D (-^ln(l/e)) 

iterations in order to obtain an e-optimal solution. As mentioned earlier, unlike ex¬ 
isting lower bounds, our bound holds for every fixed dimensionality. 

• We show that there exist matching p-SCLI optimization algorithms which attain the 
convergence rates stated above for all p. Alas, for p > 3, an expensive pre-calculation 
task renders these algorithms inefficient. 

• As a result, we focus on a restricted subclass of p-SCLI optimization algorithms which 
can be executed efficiently. This yields a novel systematic derivation of Full Gradi¬ 
ent Descent, Accelerated Gradient Descent, The Heavy-Ball method (and potentially 
others efficient optimization algorithms), each of which corresponds to an optimal 
solution of optimization problems on the moduli of polynomials’ roots. 


5 
















































• We present new schemes which offer better utilization of second-order information 
by exploiting breaches in existing lower bounds. This leads to a new optimization 
algorithm which obtains a rate of in the presence of large enough spectral 

gaps. 


1.1 Notation 

We denote scalars with lower case letters and vectors with bold face letters. We use 
to denote the set of all positive real numbers. All functions in this paper are defined over 
Euclidean spaces equipped with the standard Euclidean norm and all matrix-norms are 
assumed to denote the spectral norm. 


We denote a block-diagonal matrix whose blocks are Ai,...,Afc by the conventional 
direct sum symbol, i.e., ®\=iAk. We devote a special operator symbol for scalar matrices 
Diag (ai,..., Urf) = The spectrum of a square matrix A and its spectral radius, 

the maximum magnitude over its eigenvalues, are denoted by cr{A) and p{A), respectively. 
Recall that the eigenvalues of a square matrix A G are exactly the roots of the 

characteristic polynomial which is defined as follows 

Xa(A) = det(A - Xld) 

where Id denotes the identity matrix. Since polynomials in this paper have their origins as 
characteristic polynomials of some square matrices, by a slight abuse of notation, we will 
denote the roots of a polynomial q{z) and its root radius, the maximum modulus over its 
roots, by a{q{z)) and p{q{z)), respectively, as well. 


The following notation for quadratic functions and matrices will be of frequent use, 

= |a G A is symmetric and a(A) C s| 


Q^(S) ^ {/A,b(x) I A G ,b G R^] 


where S denotes a non-empty set of positive reals, and where /A,b(x) denotes the following 
quadratic function 


/A,b(x) =x^Ax + h^x 


2. Framework 


In the sequel we establish our framework for analyzing optimization algorithms for mini¬ 
mizing smooth and strongly convex functions. First, to motivate this technique, we show 


that the analysis of SDCA presented in Shalev-Shwartz and Zhang (2013b) is tight by us¬ 
ing a similar method. Next, we lay the foundations of the framework by generalizing and 
formalizing various aspects of the SDCA case. We then examine some popular optimiza¬ 
tion algorithms through this formulation. Apart from setting the boundaries for this work, 
this inspection gives rise to, otherwise subtle, distinctions between different optimization 
algorithms. Lastly, we discuss the computational complexity of p-SCLIs, as well as their 
convergence properties. 
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2.1 Case Study - Stochastic Dual Coordinate Ascent 

We consider an optimization algorithm called Stochastic Dual Coordinates Ascent (SDCi^ 
for solving Regularized Loss Minimization (RLM) problems Q, which are of great signifi¬ 
cance for the field of Machine Learning. It is shown that applying SDCA on quadratic loss 
functions allows one to reformulate it as a recursive application of linear transformations. 
The relative simplicity of such processes is then exploited to derive a lower bound on the 
convergence rate. 


A smooth-RLM problem is an optimization task of the following form 

1 "" A 

min P(w) = - (pii'w'^Xi) -h - ||w||^ ( 6 ) 

wsM'* n 2 

1=1 

where (j)i are 1 / 7 -smooth and convex, xi,..., are vectors in and A is a positive con¬ 
stant. For ease of presentation, we further assume that are non-negative, <^i(0) < 1 and 
||xj|| < 1 for all i. 


The optimization algorithm SDCA works by minimizing an equivalent optimization 
problem 


min D{ol) = + 


1 


aeM' 


n 


i=l 


2An2 


E 

2 = 1 


Oii^i 


where (p* denotes the Fenchel conjugate of 4>, by repeatedly picking 2 ; ~ U{[n]) uniformly 
and minimizing D{ol) over the z’th coordinate. The latter optimization problem is referred 
to as the dual problem, while the problem presented in Q is called the primal problem. 
As shown in Shalev-Shwartz and Zhang (2013b), it is possible to convert a high quality 
solution of the dual problem into a high quality solution of the primal problem. This allows 
one to bound from above the number of iterations required for obtaining a prescribed level 
of accuracy e > 0 by 

‘A 


0{ {n + —j ln(l/e) 


Let us show that this analysis is indeed tight. First, let us define the following 2-smooth 
functions 

= i = l,...,n 

and let us define xi = X 2 = • • • = x,^ = ;^ 1. This yields 


1 


1 


D(a) = -Q ' —I + ^11 ' a 
^ ^ ' 2re An2 ' 


( 7 ) 


5. For a detailed analysis of SDCA, please refer to Shalev-Shwartz and Zhang (2013b I. 
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Clearly, the unique minimizer of D{cx.) is ct* = 0. Now, given i G [n] and ot G M"' , it is easy 
to verify that 


-2 


argmin ,..., cx^ — \ , cx , ,..., ^ ^ Oij 

a'eK 2 + An ^ 


( 8 ) 


Thus, the next test point a~^, generated by taking a step along the i’th coordinate, is linear 
transformation of the previous point, i.e.. 


0:“'“ = ( / — eju7 ) a 


(9) 


Where 


T A 

U, = 


1 


2 + An ’ ’ 2 + An ’ ’ 2 + An ’ ’ 2 + An 

i’s entry 


Let Q^, k = 1,... ,K denote the fc’th test point. The sequence of points is ran¬ 

domly generated by minimizing D{ot) over the Zj’th coordinate at the Lth iteration, where 
zi, Z 2 , ■ ■ ■, zk ~ ^(N) is a sequence of K uniform distributed i.i.d random variables. Ap¬ 
plying Q over and over again starting from some initialization point we obtain 




I ^zk-\^zk-\ 


■■{I - ] oP 


To compute E[q:^] note that by the i.i.d hypothesis and by the linearity of the expectation 
operator. 


E [q^] = E 
= E 
= E 


/ - (/ - 


I-e^iuJJ q:° 


I - 


'^K ^ZK 


E 


I - , u 


zk-1 ^zk-1 


• ••E 




a 


I - 


K 


a 


( 10 ) 


The convergence rate of this process is governed by the spectral radius of 


E = E 


I - e^uj 


A straightforward calculation shows that the eigenvalues of E, ordered by magnitude, are 

1 


1 ^ 2 -|- A 

2/A-l-n’ ’2/A-|-n’ 2-1-An 


( 11 ) 


n—1 times 


By choosing to be the following normalized eigenvector which corresponds to the largest 
eigenvalue. 


1 


1 























and plugging it into Equation (10), we can now bound from below the distance of E[q:'^] 
to the optimal point o:* = 0, 


|E [ol^] - 


Q: = 


E 

= 11 - 


I - e^uj 


2/A + n 


K 


K 


OL 


a 


= 1 - 


(4/A + 2n- 1) + 1 

K 


K 


> exp » 


-1 


^2/A + n- 1 

Where the last inequality is due to the following inequality, 


1 -> exp 

X + 1 


-2 
X — 1 


Vx > 1 


( 12 ) 


We see that the minimal number of iterations required for obtaining a solution whose 
distance form the a* is less than e > 0 must be greater than 

(2/A + n — 1) In (1/e) 

Thus showing that, up to logarithmic factors, the analysis of the convergence rate of SDCA 
is tight. 


2.2 Definitions 

In the sequel we introduce the framework of p-SCLI optimization algorithms which gener¬ 
alizes the analysis shown in the preceding section. 

We denote the set oi d x d symmetric matrices whose spectrum lies in S C M++ by 
5‘^(S) and denote the following set of quadratic functions 

/A,b(x) = + b’^x, A e 5‘^(S) 

by Q'^(S). Note that since twice continuous differentiable functions /(x) are L-smooth and 
/r-strongly convex if and only if 

cj(v2(/(x))) C [/i,L] C M++, xEM^ 

we have that Q'^([/i,L]) comprises L-smooth /r-strongly convex quadratic functions. Thus, 
any optimization algorithm designed for minimizing smooth and strongly convex functions 
can be used to minimize functions in Q'^([/r, L]). The key observation here is that since the 
gradient of /y 4 ,b(x) is linear in x, when applied to quadratic functions, the update rules of 
many optimization algorithms also become linear in x. This formalizes as follows. 
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Definition 1 (p-SCLI optimization algorithms) An optimization algorithm A is called 
a p-stationary canonical linear iterative (abbr. p-SCLI) optimization algorithm overW^ if 
there exist p + 1 mappings Co{X),Ci{X),... ,Cp-i{X), N{X) from to -valued 

random variables, sueh that for any /A,b(x) G Q'^(S) the corresponding initialization and 
update rules take the following form: 

p-i 

^ Cj{A)x’^-P+^ + N{A)h, k = p,p+l, 

j=0 

We further assume that in each iteration Cj{A) and N(A) are drawn independently of 
previous realization^ and thatKCi{A) are finite and simultaneously triangularizabl^ 

Let us introduce a few more definitions and terminology which will be used throughout 
this paper. The number of previous points p by which new points are generated is called the 
lifting faetor. The matrix-valued random variables Co{X),Ci{X),..., Cp-i{X) and N{X) 
are called coeffieient matriees and inversion matrix, respectively. The term inversion matrix 
refers to the mapping N{X), as well as to a concrete evaluation of it. It will be clear from 
the context which interpretation is being used. The same comment holds for coefficient 
matrices. 

As demonstrated by the following definition, coefficients matrices of p-SCLIs can be 
equivalently described in terms of polynomial matrice^ This correspondence will soon 
play a pivotal role in the analysis of p-SCLIs. 

Definition 2 The characteristie polynomial of a given p-SCLI optimization algorithm A is 
defined by 


(13) 

(14) 


p-i 

£a(A,X) 4 - Y,^Cj{X)X (15) 

j=o 

where Cj{X) denote the coeffieient matriees. Moreover, given X G we define the root 

radius of C_a{X,X) by 

px{C{X, X)) = p{det C{X, X)) = max{|A'| | det£(A',X) = O} 

For the sake of brevity, we will sometimes specify a given p-SCLI optimization algorithm A 
using an ordered pair of a characteristic polynomial and an inversion matrix as follows 


A = {Ca{\X),N{X)) 


6. In this context, this assumption is usually referred to as stationarity. 

7. Intuitively, having this technical requirement is somewhat similar to assuming that the coefficients ma¬ 


trices commute (see Drazin et al. (19511 for a precise statement), and as such does not seem to restrict 


the scope of this work. Indeed, it is common to have 'ECfiA) as polynomials in A or as diagonal matrices, 
in which case the assumption holds true. 


8. For a detailed cover of polynomial matrices see Gohberg et al. (20091. 
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Lastly, note that nowhere in the definition of p-SCLIs did we assume that the optimiza¬ 
tion process converges to the minimizer of the function under consideration - an assumption 
which we refer to as consistency. 


Definition 3 (Consistency of p-SCLI optimization algorithms) A p-SCLI optimiza¬ 
tion algorithm A is said to be consistent with respect to a given A G if for any b G W^, 

A converges to the minimizer of f a, h{^)> regardless of the initialization point. That is, for 
defined in we have that 


X 


^ -A-^h 

for any b G Furthermore, if A is consistent with respect to all A G then we say 

that A is consistent with respect to Q^(S). 

2.3 Specifications for Some Popular optimization algorithms 

Having defined the framework of p-SCLI optimization algorithms, a natural question now 
arises: how broad is the scope of this framework and what does characterize optimization 
algorithms which it applies to? Loosely speaking, any optimization algorithm whose update 
rules depend linearly on the first and the second order derivatives of the function under con¬ 
sideration is eligible for this framework. Instead of providing a precise characterization for 
such algorithms, we apply various popular optimization algorithms on a general quadratic 
function fA,h{^) £ -^]) then re-express them as p-SCLI optimization algorithms. 


Full Gradient Descent (FGD) is a 1-SCLI optimization algorithm, 
x° G 

^k+i _ ^v/(x^) = - /3(Hx^ + b) = (/ - - /3b 


See Nesterov (2004) for more details. 


Newton method is a 0-SCLI optimization algorithm. 


x° G 
+ l _ ^/c 


xfc+i = - (VV(x^))“^V/(x'') = x'^ - A-^{Ax^ + b) 


= (/ - H-M)x'' - A-^h = -A-^h 


-ii 


i-lv 


Note that Newton method can be also formulated as a degenerate p-SCLI for some 
p G N, whose coefficients matrices vanish. See Nesterov (2004) for more details. 
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The Heavy Ball Method is a 2-SCLI optimization algorithm. 


Jc+l _ 




= - aV/(x'') + ± 

= x^ - a(^x*^ + b) + ;0(x^ - x^-i) 

= ((1 + /3)/ - aA) x^ - /3/x*^-i - ab 


a = 


T + 


2 ’ 


/3 = 


-v/L + 


See Polyak (1987) for more details. 


Accelerated Gradient Descent (AGD) is a 2-SCLI optimization algorithm. 

x° = y° e 

^+1 = x^ - ^V/(x^) 

:^+i = (l+a)y^+i_ay'= 

y/Z-^ 



y 


X- 

Which can be rewritten 

as, 

x° G 


= (1 J- a) 


= (1 J- a) 


= (1 -ba) 



a = 


“ 1 “ \/Z 


L 


L 


k-l 


L 


k-l 


k-V 


k-l 


L 


+ b) 


L 


L 


.k-l 


L 


b 


See Nesterov (2004) for more details. 


Stochastic Coordinate Descent (SCD) is a 1-CLI optimization algorithm. This is an 
extension of the example shown in Section 2.1 SCD acts by repeatedly minimizing a 
uniformly randomly drawn coordinate in each iteration. That is, 


x^G 


Pick i ~ ^([d]) and set = ( / — —^eja^'^. j x'' — 


Ai i 


hi 


Ai i 


where a^^ denotes the Pth row of A and b = {bi,b 2 , ■ ■ ■ ,bfi)- Note that the expected 
update rule of this method is equivalent to the well-known Jacobi’s iterative method. 

We now describe some popular optimization algorithms which do not fit this framework, 
mainly because the stationarity requirement fails to hold. The extension of this framework 
to cyclic and piecewise stationary optimization algorithms is left to future work. 
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Conjugate Gradient Descent (CGD) Can be expressed as a non-stationary linear it¬ 
erative method. 


X 


k-\-l 


= ((1 + Pk)I - akA) x" - 


where and f3k are computed at each iteration based on x* 


X' 


fc-i 


, A and b. Note the 


similarity of CGD and the heavy ball method. See Polyak (1987); Nemirovski (2005) 
for more details. 


Stochastic Gradient Descent (SGD) A straightforward extension of the deterministic 
FGD. Specihcally, let {Q,,T,V) be a probability space and let G(x,a;) : x D — >■ 

be an unbiased estimator of V/(x) for any x. That is, 

E[G(x, w)] = V/(x) = Ax + b, X G 

Equivalently, dehne e(x, w) = G(x,a;) — (Ax + b) and assume E[e(x, cj)] = 0, x G 
SGD may be defined using a suitable sequence of step sizes as follows 

Generate randomly and set = x^ — 7jG(x^, Wfc) 

= (/ - 7jA) x^ - 7ib - 7ie(x, to) 

Clearly, some types of noise may not form a p-SCLI optimization algorithm. However, 
for some instances, e.g., quadratic learning problems, we have 

e(x,a;) = A^x + b(^ 


such that 


E[A^] = 0, E[b^] = 0 


If, in addition, the step size is fixed then we get a 1-SCLI optimization algorithm. See 


Kushner and Yin (2003); Spall (2005); Nemirovski (2005) for more details. 


2.4 Computational Complexity 

The stationarity property of general p-SCLIs optimization algorithms implies that the com¬ 
putational cost of minimizing a given quadratic function /yi,b(x), assuming 0 (1) cost for 
all arithmetic operations, is 


^ Iterations x < 


Generating coefficient and inversion matrices randomly 
-7 

Executing update rule ([M]) based on the previous p points 


The computational cost of the execution of update rule scales linearly with d the di¬ 
mension of the problem and p the lifting factor. Thus, the running time of p-SCLIs is mainly 
affected by the iterations number and the computational cost of randomly generating co¬ 
efficient and inversion matrices each time. Notice that for deterministic p-SCLIs one can 
save running time by computing the coefficient and inversion matrices once, prior to the 
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execution of the algorithm. Not surprisingly, but interesting nonetheless, there is a law of 
conservation which governs the total amount of computational cost invested in both factors: 
the more demanding is the task of randomly generating coefficient and inversion matrices, 
the less is the total number of iterations required for obtaining a given level of accuracy, 
and vice verse. Before we can make this statement more rigorous, we need to present a few 
more facts about p-SCLIs. For the time being, let us focus on the iteration complexity, i.e., 
the total number iterations, which forms our analogy for black box complexity. 


The iteration complexity of a p-SCLI optimization algorithm A with respect to an ac¬ 
curacy level e, an initialization points and a quadratic function symbolized 

by 

XC^(e,/Tb(x),T0) 

is defined to be the minimal number of iterations K such that 

E[x^ - X*] < e, yk>K 


where x* = —A is the minimizer of /yi^b(x), assuming A is initialized at We would 
like to point out that although iteration complexity is usually measured through 


E 


x^ -x^* 


here we employ a different definition. We will discuss this issue shortly. 


In addition to showing that the iteration complexity of p-SCLI algorithms scales log¬ 
arithmically with 1/e, the following theorem provides a characterization for the iteration 
complexity in terms of the root radius of the characteristic polynomial. The full proof for 
this theorem is somewhat long and thus provided in Section [C.l 

Theorem 4 Let A be a p-SCLI optimization algorithm overW^ and let fA,h{^) £ , (S C 

be a quadratic function. Then, there exists X^ G such that 

TCa (e,/A,b(x),T°) = n (^^ln(l/e)) 

and for all X^ G it holds that 

ICa (e,/^,b(x),T0) = o(^^ln(l/e)) 

where p denotes the root radius of the characteristic polynomial evaluated at X = A. 

We remark that the constants in the asymptotic behavior above may depend on the quadratic 
function under consideration, and that the logarithmic terms depend on the distance of the 
initialization points from the minimizer, as well as the lifting factor and the spectrum of 
the second-order derivative. For the sake of clarity, we shall usually omit the dependency 
on the initialization points. 
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There are two, rather subtle, issues regarding the definition of iteration complexity which 
we would like to address. First, observe that in many cases a given point x G is said to 
be e-optimal w.r.t some real function / : —)• M if 

/(x) < min /(x) + e 

xeK'i 

However, here we employ a different measure for optimality. Fortunately, in our case either 
can be used without essentially affecting the iteration complexity. That is, although in 
general the gap between these two definitions can be made arbitrarily large, for L-smooth 
/x-strongly convex functions we have 

|||x-x-||2</(x)-/(x-)<^||x-x-f 

Combining these two inequalities with the fact that the iteration complexity of p-SCLIs 
depends logarithmically on 1/e implies that in this very setting these two distances are 
interchangeable, up to logarithmic factors. 

Secondly, here we measure the sub-optimality of the fc’th iteration by ||E[x*^ — x*]||, 
whereas in many other stochastic settings it is common to derive upper and lower bounds 
on E [||x^ — x*||]. That being the case, by 


-1 

><! 

?!- 

1 

X 

* 

2 ' 

= E 

X 

1 

-K 

X 

2 ' 


E 

x^ - x*j 

. 


. 







we see that if the variance of the fe’th point is of the same order of magnitude as the 
norm of the expected distance from the optimal point, then both measures are equivalent. 


Consequently, our upper bounds imply upper bounds on E 


Ix'^ -X* 


for deterministic 


algorithms (where the variance term is zero), and our lower bounds imply lower bounds 
121 

, for both deterministic and stochastic algorithms (since the variance is 


on E 


|x^ -X* 


always non-negative). We defer a more adequate treatment for this matter to future work. 


3. Deriving Bounds for p-SCLI Algorithms 

The goal of the following section is to show how the framework of p-SCLI optimization al¬ 
gorithms can be used to derive lower and upper bounds. Our presentation follows from the 
simplest setting to the most general one. case to th First, we present a useful characteriza¬ 
tion of consistency (see Definition of p-SCLIs using the characteristic polynomial. Next, 
we demonstrate the importance of consistency through a simplified one dimensional case. 
This line of argument is then generalized to hnite dimensional spaces and is used to explain 
the role of the inversion matrix. Finally, we conclude this section by providing a schematic 
description of this technique for the most general case which is used both in Section Q to 
establish lower bounds on the convergence rate of p-SCLIs with diagonal inversion matrices, 
and in Section ([^ to derive efficient p-SCLIs. 
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3.1 Consistency 

Closely inspecting various specifications for p-SCLI optimization algorithms (see Section 


(2.3)) reveals that the coefficient matrices always sum up to I + EA^(X)X, where N{X) 
denotes the inversion matrix. It turns out that this is not a mere coincidence, but an 
extremely useful characterization for consistency of p-SCLIs. To see why this condition 
must hold, suppose .4, is a deterministic p-SCLI algorithm over whose coefficient matrices 
and inversion matrix are C'o(X), ..., Cp-i{X) and N{X), respectively, and suppose that A 
is consistent w.r.t some A G 5'^(S). Recall that every p+1 consecutive points generated 


by A are related by (14) as follows 


p-i 


^ Cj(4)x^-P+^' + A^(^)b, /t = + 1,... 

i=o 

Taking limit of both sides of the equation above and noting that by consistency 


X 


^ -A-^h 


for any b G M'^, yields 


Thus, 


p-i 

-A-^h = -Y^ Cj{A)A-^h + N{A)h 
j=o 


p-i 


- 4-1 = -Yj Cj{A)A-^ + N{A) 

j=0 


Multiplying by A and rearranging, we obtain 


p-i 

^CjiA) = Id + N{A)A 

j=0 


(16) 


On the other hand, if instead of assuming consistency we assume that A generates a conver¬ 
gent sequence of points and that Equation (16) holds, then the arguments used above show 
that the limit point must be —^-^b. In terms of the characteristic polynomial of p-SCLIs, 
this formalized as follows. 

Theorem 5 (Consistency - System Polynomials) Suppose A = {C{X, X), N{X)) is a 
p-SCLI optimization algorithm. Then, A is consistent with respect to A € 5‘i(S) if and 
only if the following two conditions hold: 


1. £^(1,4) = -EiV(4)4 

2. pa(£(A,4)) < 1 


(17) 

(18) 


The proof for the preceding theorem is provided in Section C.2 This result will be used 
extensively throughout the reminder of this work. 
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3.2 Simplified One-Dimensional Case 

To illustrate the significance of consistency in the framework of p-SCLIs, consider the fol¬ 
lowing simplified case. Suppose ^ is a deterministic 2-SCLI optimization algorithm over 
Q^([/i, L]), such that its inversion matrix N{x) is some constant scalar i/ G M and its co¬ 
efficient matrices co(x),ci(x) are free to take any form. The corresponding characteristic 
polynomial is 

C{\,x) = — ci(x)A — co(x) 

Now, let fa,b{x) G L]) be a quadratic function. By Theorem]^ we know that A 

converges to the minimizer of fa,b{x) with an asymptotic geometric rate of p\{C{X, a)), the 
maximal modulus root. Thus, ideally we would like to set Cj{x) = 0, j = 0,1. However, 
this might violate the consistency condition according to which, one must maintain 

£(1, a) = —ua 

That being the case, how little can p\ (£(A,a)) be over all possible choices for Cj{a) which 
satisfy C{l,a) = —val Formally, we seek to solve the following minimization problem. 

p* = min{p;^(£(A,a)) | C{\,a) is a real monic quadratic polynomial in A and C{1) = —va} 


By consistency we also have that p* must be strictly less than one. This readily implies 
that —va > 0. In which case. Lemma below gives 

p* > P (^(A — 1 — = \yj—va — l| (19) 

The key ingredient here is that u cannot be chosen so as to be optimal for all Q}{\p,, L\), at 
one and the same time. Indeed, the preceding inequality holds in particular for a = p, and 
a = L, by which we conclude that 


p* > max 1 1 ^/—vp — 11 , \/—vL ~ 1 | 


> 


K — 1 


K -b 1 


( 20 ) 


where k = T/p. Plugging in Inequality (20) into Theorem implies that there exists 
fa,b{x) G Q}{\p,L\) such that the iteration complexity of A for minimizing it is 


To conclude, by applying this rather natural line of argument we have established a lower 
bound on the convergence rate of any 2-SCLI optimization algorithms for smooth and 
strongly convex function over M, e.g., AGD and HB. 


3.3 General Case and the Role of the Inversion Matrix 

We now generalize the analysis shown in the previous simplified case to any p-SCLI opti¬ 
mization algorithm over any finite dimensional space. This generalization relies on a useful 
decomposability property of the characteristic polynomial, according to which deriving a 
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lower bound on the convergence rate of p-SCLIs over is essentially equivalent for deriving 
d lower bounds on the maximal modulus of the roots of d polynomials over M. 


Let A = {C{X, X), N{X)) be a consistent deterministic p-SCLI optimization algorithm 


and let /A,b(x) G Q“(S) be a quadratic function. By consistency (see Theoremwe have 

C{l,A) = -NA 

(for brevity we omit the functional dependency on X). Since coefficient matrices are as¬ 
sumed to be simultaneously triangularizable, there exists an invertible matrix Q G 
such that 

Tj = Q-^CjQ, j = 0, 

are upper triangular matrices. Thus, by the definition of the characteristic polynomial 
(Definition]^ we have, 

/ p-i 

det £(A, X) = det X)Q) = det j I^X^ — pi) 


j=0 




where 


p-i 


ij{X) = XP-Y^a^X^ 


( 22 ) 


k=0 


and where aj,..., cr^, j = 0,... ,p — 1 denote the elements on the diagonal of Tj, or equiv¬ 
alently the eigenvalues of Cj ordered according to Q. Hence, the root radius of the charac¬ 
teristic polynomial of A is 


p\{C{X, X)) = max{| A | | ii{X) = 0 for some i G [d]} 


(23) 


On the other hand, by consistency condition (17) we get that for all i G [d] 


ii{l) = aiiCil)) = ai{-NA) (24) 

It remains to derive a lower bound on the maximum modulus of the roots of ii{X), subject 
to the constraint (24). To this end, we employ the following lemma whose proof can be 
found in Section IC.31 

Lemma 6 Suppose q{z) is a real monic polynomial of degree p. If < 0, then 

p{q{z)) > 1 

Otherwise, if q{l) > 0, then 

p{q{z)) > 1 

In which ease, equality holds if and only if 

q{z) = {z-il¬ 


ls 










We remark that the second part of Lemma implies that subject to constraint ( |24[ ), the 
lower bound stated above is unimprovable. This property is used in Section where we 
aim to obtain optimal p-SCLIs by designing ij{X) accordingly. Clearly, in the presence of 


additional constraints, one might be able to improve on this lower bound (see Section 4.2). 


Since A is assumed to be consistent. Lemmaimplies that a{—N{A)A) C M+, as well 
as the following lower bound on the root radius of the characteristic polynomial. 


p\{£{X, X)) > max 
ie[d\ 


Va,i-N{A)A) - 1 


(25) 


Noticing that the reasoning above can be readily applied to stochastic p-SCLI optimization 
algorithms, we arrive at the following corollary which combines Theorem]^ and Inequality 

@. 

Corollary 7 Let A be a consistent p-SCLI optimization algorithm with respect to some 
A G 5'^(S), let N{X) denote the corresponding inversion matrix and let 


P 


* 


max 

iG[d] 


^ai{-EN{A)A) - 1 


then the iteration complexity of A for any /A,b(x) G Q'^(S) is lower bounded by 

p* 

Using Corollary we are now able to provide a concise ’plug-and-play’ scheme for 
deriving lower bounds on the iteration complexity of p-SCLI optimization algorithms. To 
motivate this scheme, note that the effectiveness of the lower bound stated in Corollary 
is directly related to the magnitude of the eigenvalues of —N{X)X. To exemplify this, 
consider the inversion matrix of Newton method (see Section [2. 3[ ) 

N{X) = -X-^ 


Since 

a{-N{X)X) = {1} 

the lower bound stated above is meaningless for this case. Nevertheless, the best computa¬ 
tional cost for computing the inverse oidxd regular matrices known today is super-quadratic 
in d. As a result, this method might become impractical in large scale scenarios where the 
dimension of the problem space is large enough. A possible solution is to employ inversion 
matrices whose dependence on X is simpler. On the other hand, if N{X) approximates 
—X~^ very badly, then the root radius of the characteristic polynomial might get too large. 
For instance, if N{X) = 0 then 


a{-N{X)X) = {0} 


contradicting the consistency assumption, regardless of the choice of the coefficient matrices. 
In the light of this, many optimization algorithms can be seen as strategies for balancing the 
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computational cost of obtaining a good approximation for the inverse of X and executing 
large number of iterations. Put differently, various structural restrictions on the inversion 
matrix yield different a{—N{X)X), which in turn lead to a lower bound on the root radius 
of the corresponding characteristic polynomial. This gives rise to the following scheme: 


Scheme 1 

Lower bounds 


Parameters: 

• A family of quadratic functions Q‘^(S) 

• An inversion matrix N(X) 

• A lifting factor p G N, 

Choose 

5' C 


Verify 

VA G 5', 

a{-KN{A)A) C (0,- 

2P) for consistency 

Bound 

max 

A^S',i£[d] 

^ai{-EN{A)A) - 1 

from below by some p* G [0,1) 

Lower bound: 




This scheme is implicitly used in the previous Section (3.2), where we established a lower 
bound on the convergence rate of 2-SCLI optimization algorithms over M with constant 
inversion matrix and the following parameters 

S = [/i,L], S' = {fi,L} 

In Section]^ we will make this scheme concrete for scalar and diagonal inversion matrices. 


3.4 Bounds Schemes 

In spite of the fact that Scheme 1 is expressive enough for producing meaningful lower 
bounds under various structures of the inversion matrix, it does not allow one to incorpo¬ 
rate other lower bounds on the root radius of characteristic polynomials whose coefficient 
matrices admit a particular form, e.g., linear coefficient matrices (see |35| below). Abstract¬ 
ing away from Scheme 1, we now formalize one of the main pillar of this work, i.e., the 
relation between the amount of computational cost one is willing to invest in executing 
each iteration and the total number of iterations needed for obtaining a given level of ac¬ 
curacy. We use this relation to form two schemes for establishing lower and upper bounds 
for p-SCLIs. 


Given a compatible set of parameters: a lifting factor p, an inversion matrix N(X), set of 
quadratic functions and a set of coefficients matrices C, we denote by 2l(p, N{X), , C) 

the set of consistent p-SCLI optimization algorithms for Q'^(S) whose inversion matrix 
are N{X) and whose coefficient matrices are taken from C. Furthermore, we denote by 
S,{p, N{X), ,C) the following set of polynomial matrices 


p-i 


£(A, A) = IdXP - ^ECj(A)A^' 

j=0 


Cj{X)GC, C{1,A) = -N{A)A, 
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Since both sets are determined by the same set of parameters, the specifications of which 
will be occasionally omitted for brevity. The natural one-to-one correspondence between 
these two set, as manifested by Theorem]^ and Corollaryyields 


min max n\(CA{\,A)) 
^621 


min max 
£{A,X)e£ AecS-^lS) 


Px{C{\A)) 


(27) 


The importance of Equation (27) stems from its ability to incorporate any bound on the 
maximal modulus root of polynomial matrices into a general scheme for bounding the 
iteration complexity of p-SCLIs. This is summarized by the following scheme. 


Scheme 2 

Lower bounds 

Given 

a set of p-SGLI optimization algorithms 2l(p, AI(A), ,C) 

Find 

p* G [0,1) such that 


min max px (C(X, A)) > 


£(A,X)e£ AeSd{T.) 

Lower bound: 

^ (rffc ln(l/£)) 


Thus, Scheme 1 is in effect an instantiation of the scheme shown above using Lemma 
This correspondence of p-SCLI optimization algorithms and polynomial matrices can be 
also used contrariwise to derive efficient algorithm optimization. Indeed, in Section [2.3| we 
show how FGD, HB and AGD can be formed as optimal instantiations of the following dual 
scheme. 


Scheme 3 Optimal p-SCLI Optimization Algorithms 

Given a set of polynomial matrices £(p, N{X), ,C) 

Compute P* = niin max p\(Ci\,A)) 

C{\,X)&S, A&S’XE) 

and denote its minimizer by C* (A, A) 

Upper bound: The corresponding p-SGLI algorithm of C* (A, A) 

Convergence rate: ln(l/e)^ 


4. Lower Bounds 

In the sequel we derive lower bounds on the convergence rate of p-SCLI optimization algo¬ 
rithms whose inversion matrices are scalar or diagonal, and discuss the assumptions under 
which these lower bounds meet matching upper bounds. It is likely that this approach can 
be also effectively applied for block-diagonal inversion, as well as for a much wider set of 
inversion matrices whose entries depend on a relatively small set of entries of the matrix to 
be inverted. 
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4.1 Scalar and Diagonal Inversion Matrices 


We derive a lower bound on the convergence rate of p-SCLI optimization algorithms for 
L-smooth ;U-strongly convex functions over with a scalar inversion matrix N{X) by em¬ 
ploying Scheme 1 (see Section 3.3). Note that since the one-dimensional case was already 
proven in Section [3.2| we may assume that d>2. 


First, we need to pick a ‘hard’ matrix in 5'^([/i, L]). It turns out that any positive-definite 
matrix A G L]) for which 


C a{A) 

will meet this criterion. For the sake of concreteness, let us define 

A = Diag(L, /r, .^. ,/xJ 

d—1 times 


(28) 


In which case, 


= a(-EN(A)A) 

where ly = E[A^(A)]. Thus, to maintain consistency, it must hold thalj^ 


ly G 


-2P 


L 


-,o 


(29) 


Next, to bound from below 


p* = max 
i&{d] 


^ai{-iyA) - 1 


= max 


—ly^ — l\,\ ^—vL “ 1| I 


we split the feasible range of (29) into three different sub-ranges as follows: 



^-lyfjL - 1 < 0 

- 1 > 0 

< 0 

Case 1 

Range: [—1/L,0) 
Minimizer: ly* = —l/L 

Lower bounds: 1 — 

N/A 

> 0 

Case 2 

Case 3 (requires: p > log 2 n) 
Range: (—2P/L, —1/p] 

Minimizer: — 1/p 

Lower Bound: f/— — 1 

V fy 

Range: (—1/p,—l/L) 

f 2 \P 

iUimmizcr. 

Lower bound: ^ 


Table 1: Lower bound for p* by subranges of v 


9. On a side note, this reasoning also implies that if the spectrum of a given matrix A contains both positive 
and negative eigenvalues then A~^b cannot be computed using p-SCLIs with scalar inversion matrices. 


22 
























Therefore, 


(30) 


^ ■ ,1 

> mm < 1 - {/ T) 


^ L / p +1 y ^ 


- - 1 I = ^ 


1 


Where k = L/p, upper bounds the condition number of functions in Q^([p, L]). Thus, by 
Scheme 1 we get the following lower bound on the worse-case iteration complexity. 


n 


ln(l/e; 


(31) 


As for the diagonal case, it turns out that for any quadratic /y 4 ,b(x) G Q'^{[p, L]) which has 


L —/i 

2 2 

L—(1 L-\-fi 

~2~ ~2~ 


(32) 


as a principal sub-matrix of A, the best p-SCLI optimization algorithm with a diagonal 
inversion matrix does not improve on the optimal asymptotic convergence rate achieved by 


scalar inversion matrices (see Section C.4). Overall, we obtain the following theorem. 


Theorem 8 Let A he a consistent p-SCLI optimization algorithm for L-smooth p-strongly 
convex functions over If the inversion matrix of A is diagonal, then there exists a 
quadratic function /yi,b(x) G Q'^{[p,L]) such that 


ACj\^ie,fA,h{^)) = ^ ln(l/e; 


(33) 


where k = L/p. 


4.2 Is This Lower Bound Tight? 

A natural question now arises: is the lower bound stated in Theorem tight? In short, 
it turns out that for p = 1 and p = 2 the answer is positive. For p > 2 the answer 
heavily depends on whether a suitable spectral decomposition is within reach. Obviously, 
computing the spectral decomposition for a given positive definite matrix A is at least 
as hard as hnding the minimizer of a quadratic function whose hessian is A. To avoid 
this, we will later restrict our attention to linear coefficients matrices which allow efficient 
implementation. 

A matching upper bound for p = 1 In this case the lower bound stated in Theorem 
is simply attained by FGD (see Section 2.3). 


A matching upper bound for p = 2 In this case there are two 2-SCLI optimization al¬ 
gorithm which attain this bound, namely, Accelerated Gradient Descent and The 
Heavy Ball method (see Section [2.3[ ), whose inversion matrices are scalar and corre¬ 
spond to Case 1 and Case 2 in Table i.e.. 


A^hb = - 


"v/A -I- ^/p 


Nagd = -j-It 
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Although HB obtains the best possible convergence rate in the class of 2-SCLIs with 
diagonal inversion matrices, it has a major disadvantage. When applied on general 
smooth and strongly-convex functions, one cannot guarantee global convergence. That 
is, in order to converge correctly, HB must be initialized close enough to the minimizer 
(see Section 3.2.1 in Polyak (1987)). Indeed, if the initialization point is too far from 
the minimizer then HB may diverge as shown in Section 4.5 in Lessard et al. (2014). 
In contrast to this, AGD attains a global linear convergence with a slightly worse 
factor. Put differently, the fact HB is highly adapted to quadratic functions prevents 
it from converging globally to the minimizers of general smooth and strongly convex 
functions. 


A matching upper bound for p > 2 In Subsection we show that when no restriction 
on the coefficient matrices is imposed, the lower bound shown in Theorem]^ is tight, 
i.e., for any p G N there exists a matching p-SCLI optimization algorithm with scalar 
inversion matrix whose iteration complexity is 

0(^ln(l/e)) (34) 

In light of the existing lower bound which scales according to -y/It, this result may seem 
surprising at first. However, there is a major flaw in implementing these seemingly 
ideal p-SCLIs. In order to compute the corresponding coefficients matrices one has 
to obtain a very good approximation for the spectral decomposition of the positive 
definite matrix which defines the optimization problem. Clearly, this approach is 
rarely practical. To remedy this situation we focus on linear coefficient matrices 
which admit a relatively low computational cost per iteration. That is, we assume 
that there exist real scalars ai,..., Up-i and /3i,..., fip-i such that 

Cj{X) = ajX +/3jld, j = 0,1 ,... ,p-1 (35) 

We believe that for these type of coefficient matrices the lower bound derived in 
Theorem is not tight. Precisely, we conjecture that for any 0 < p < L and for any 
consistent p-SCLI optimization algorithm A with diagonal inversion matrix and linear 
coefficients matrices, there exists /A,b(x) G Q'^([p,L]) such that 


Pa(£^(A,A)) > 


At — 1 


+1 


where k = L/p. Proving this may allow to derive tight lower bounds for many 
optimization algorithm in the field of Machine Learning, such as SAC (see Section 


2.3), whose structure is often very close to that of p-SCLIs with linear coefficients 


matrices. By Scheme 2, this conjecture may be equivalently stated as follows: suppose 
q{z) is ap-degree monic real polynomial such that ( 7 ( 1 ) = 0. Then, for any polynomial 
r{z) of degree p — 1 and for any 0 < p < L, there exists p G [p, L] such that 


p{qiz) - r]r{z)) > 


1 

\/L/p -|- 1 
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That being so, can we do better if we allow families of quadratic functions Q‘^(S) 
where S are not necessarily continuous intervals? It turns out that the answer is 
positive. Indeed, in Section]^ we present a 3-SCLI optimization algorithm with linear 
coefficient matrices which, by being intimately adjusted to quadratic functions whose 
hessian admits large enough spectral gap, beats the lower bound of Nemirovsky and 
Yudin Q. This apparently contradicting result is also discussed in Section]^ where 
we show that lower bound ([^ is established by employing quadratic functions whose 
hessian admits spectrum which densely populates [/i, L\. We would like to stress that 
as useful as such optimization algorithm might be, it is provided only for the purpose 
of demonstrating the detailed analysis which this framework allows and for indicating 
that applications which exhibit spectrum that distribute differently in [(i, L] might 
admit faster general solvers than what is dictated by lower bound Q. 


5. Upper Bounds 


Up to this point we have projected various optimization algorithms on this framework of p- 
SCLI optimization algorithms, thereby converting questions on convergence properties into 
questions on moduli of roots of polynomials. In what follows, we shall head in the opposite 
direction. That is to say, first we define a polynomial (see Definition Q) which meets 
a prescribed set of constraints, and then we form the corresponding p-SCLI optimization 


algorithm. As stressed in Section 4.2, we will focus exclusively on linear coefficient matrices 


which admit a low per-iteration computational cost and allow a straightforward extension to 
general smooth and strongly convex functions. Surprisingly enough, this allows a systematic 
recovering of FGD, HB, AGD, as well as establishing new optimization algorithms which 
allow better utilization of second-order information. This line of inquiry is particularly 
important due to the obscure nature of AGD, and further emphasizes its algebraic char¬ 
acteristic. We defer stochastic coefficient matrices, as in SDCA, (Section|2.1|) to future work. 


This section is organized as follows: First we apply Scheme 3 to derive general p-SGLIs 
with linear coefficients matrices; Next, following this line of argument, we recover AGD and 
HB as optimal instantiations under this setting; Finally, although general p-SGLI algorithms 
are exclusively specified for quadratic functions, we show how p-SGLIs with linear coefficient 
matrices can be extended to general smooth and strongly convex functions. 


5.1 Linear Coefficient Matrices 


In the sequel we instantiate Scheme 3 (see Section 3.4) for Clii 
linear coefficient matrices. 


f, the family of deterministic 


First, note that due to consistency constraints, inversion matrices of constant p-SGLIs 
with linear coefficient matrices must be either constant scalar matrices or else be com¬ 
putationally equivalent to A~^. Therefore, since our motivation for resorting to linear 
coefficient matrices was efficiency, we can safely assume that N{X) = for some v G 
(—2^/L,0). Following Scheme 3, we now seek the optimal characteristic polynomial in 
£,{p,iyld, Q'^([p,L]) ,CLinear) with a compatible set of parameters (see Section 3.4). In the 
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presence of linearity, the characteristic polynomials takes the following simplified form 


By (23) we have 


p-i 

£(A, X) = \P - ^{ajX + bjId)X^, aj,bj G M 

j=0 


px{C{X,X)) = max{|A| | 3i G [d], ii{X) = 0} 


where ii{X) denote the factors of the characteristic polynomial as in (22). That is, denoting 
the eigenvalues of JT by oi,..., <7^ we have 


p-i 

i(X) = X^ - + bj)X^ = ^ ^ 

j=0 


p-i 

:E' 

j=0 


p-i 

E- 

i=o 


Thus, we can express the maximal root radius of the characteristic polynomial over L]) 

in terms of the following polynomial 


£(A, T]) = XP - {r]a{X) + 6(A)) 


(36) 


for corresponding real univariate p 


1 degree polynomials a(A) and 6(A), whereby 


max px(C(X,A))= max p(i(X,p)) 
A&SdiT.) »?e[At,L] 


That being the case, finding the optimal characteristic polynomial in £ translates to the 
following minimization problem, 


minimize max px(i(X,p)) 


£(A,r?)e£ 

r]£[p,L] 


S.t. 


(37) 


< 1 

(38) 


(Note that in this case we think of £ as a set of polynomials whose variable assumes 
scalars). Let us calculate the optimal characteristic polynomial for the setting where the 
lifting factor is p = 1, the family of quadratic functions under considerations is Q'^{[p,L]) 
and the inversion matrix is N{X) = uld, v G (—2/L,0). In which case (36) takes the 
following form 


l{X,p) = X - rjao - bo 


where ao,6o are some real scalars, 
other choice but to set 


In order to satisfy (37) for all p G [p,L], we have no 


ao = v, 6o = 1 


which implies 


Px{i{X,p)) = 1 + 1 / 7 / 


26 







Since v G (—2/L,0), condition 38 follows, as well. The corresponding 1-SCLI optimization 
algorithm is 


= (/ + + i/b 


and its first-order extension (see Section 5.3 below) is precisely FGD (see Section 2.3). 
Finally, note that the corresponding root radius is bounded from above by 


K 

K + 1 

for V = —IjL, the minimizer in Case 2 of Table and by 

K — 1 
K + 1 

for u = ) the minimizer in Case 3 of Table This proves that FGD is optimal for 

the class of 1-SCLIs with linear coefficient matrices. Figure |5.1| shows how the root radius 
of the characteristic polynomial of FGD is related to the eigenvalues of the hessian of the 
quadratic function under consideration. 



Figure 1: The root radius of FGD vs. various eigenvalues of the corresponding hessian. 


5.2 Recovering AGD and HB 

Let us now calculate the optimal characteristic polynomial for the setting where the lifting 
factor is p = 2, the family of quadratic functions under considerations is Q'^([/r, L\) and the 
inversion matrix is N{X) = uld, G (—4/L, 0) (recall that the restricted range of v is due 
to consistency). In which case (36) takes the following form 

f'(A, 7?) = - p(aiA -I- ao) - (6iA -|- ho) (39) 


for some real scalars ao, ai, feo, &!• Our goal is to choose oq, oi, 6o, bi so as to minimize 

max px{£i\,r])) 

ri&[ii,L] 
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while maintaining conditions (37) and (38). Note that i{X,r]), when seen as a function of 
r], forms a linear path of quadratic functions. Thus, a natural way to achieve this goal is 
to choose i{X, rj) so that £{X, fi) and i{X, L) take the form of the ’economic’ polynomials 
introduced in Lemma namely 

{X-{l-V~r)f 

for r = and r = —vL, respectively, and hope that for others rj G the roots of 

i{X, rj) would still be of small magnitude. Note that due to the fact that i{X, rj) is linear in 
rj, condition (37) readily holds for any rj G {ji, L). This yields the following two equations 

2 


/^) = (a - (1 - 

£(A,L) = (A-(l-y^)' 


Substituting (39) for i{X,r]) and expanding the r.h.s of the equations above we get 

A^ — (fli/i + 6i)A — {(iQjj, + bo) = }? — 2(1 — \/—n/r) A + (1 — \/—n/i)^ 

A^ - (aiL + 5i)A - (aoT + bo) = X^ - 2(1 - y/-vL)X + (1 - ^/-uL)"^ 

Which can be equivalently expressed as the following system of linear equations 

-{aiji + bi) = -2(1 - 'J-i'jj) 

-(aoAi + bo) = (1 - 

— (oiL + bi) = —2(1 — y/—uL) 

— (aoL + bo) = (1 — y/—uL)"^ 


(40) 

(41) 

(42) 

(43) 


Multiplying Equation (40) by -1 and add to it Equation (42). Next, multiply Equation 
by -1 and add to it Equation (|43|) yields 


ai{jx - L) = 2y/XX^{VL - y/Ji) 

aoiji - L) = {1- yf-vL)"^ - (1 - y/-^)"^ 


Thus, 


Ol = 


-2y/^ 

y/JI + y/L 


ao — 


2y/^ 

y/Jr + y/L 


+ y 


Remarkably enough, plugging in v = —1/L (see Table into the equations above and 
solving for 6i and bo yields a 2-SCLI optimization algorithm whose extension is precisely 
AGD (see Section 2.3). Eollowing the same derivation only this time by setting 

/ \ 2 

I 2 \ 

V = — 


^ '/L + y/~/ ^ 


yields the Heavy-Ball method. 
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Figure 2: The root radius of AGD and HB vs. various eigenvalues of the corresponding 
hessian. 


Moreover, using standard formulae for roots of quadratic polynomials one can easily 
verify that 

PA (^(A,?7)) < rj e [n,L] 

for AGD, and 


Pa (^(A,p)) < 


P G [p, L] 


for HB. In particular. Condition 38 holds. Figure 5.2 shows how the root radii of the 
characteristic polynomials of AGD and HB are related to the eigenvalues of the hessian of 
the quadratic function under consideration. 


5.3 First-Order Extension for p-SCLIs with Linear Coefficient Matrices 

As mentioned before, as coefficient matrices of p-SCLIs can take any form, it is not clear 
how to use a given p-SCLI algorithm, efficient as it may be, for minimizing general smooth 
and strongly convex functions. That being the case, one could argue that recovering the 
specifications of, say, AGD for quadratic functions does not necessarily imply how to recover 
AGD itself. Fortunately, consistent p-SGLIs with linear coefficients can be reformulated as 
optimization algorithms for general smooth and strongly convex functions in a very natural 
way by substituting V/(x) for Ax + b, while preserving the original convergence properties 
to a large extent. In the sequel we briefly discuss this appealing property, namely, canonical 
first-order extension, which completes the path from the world of polynomials to the world 
optimization algorithm for general smooth and strongly convex functions. 
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Let A = (£^(A, X), N(X)) be a consistent p-SCLI optimization algorithm with a scalar 
inversion matrix, i.e., N{X) = vld, v G (—2^/L,0), and linear coefficient matrices 

Cj{X) = ajX j = 0, (44) 

where qq, ■ ■ ■, flp-i G M and bg, ■ ■ ■, bp-i G M denote real scalars. Recall that by consistency, 
for any /A,b(x) G Q'^(£) it holds that 


p-i 

Y,Cj{A) =1 + 1. a 

j=o 


Thus 


p-i 


p-i 


^ = 1 and 


ai = u 


(45) 


j=0 j=0 

By the very definition of p-SCLI optimization algorithms (Definition [^, we have that 
= Co(A)x^-P + Ci(^)x^-(P-i) + • • • + Cp_i(A)x^-i + uh 
Substituting Cj{A) for ( [44| ), gives 

= {aoA + 6o)x^-P + {aiA + 6i)x^-(p-i) + • • • + (ap.iA + 6p_i)x'=-i + uh 


Rearranging and plugging in 45 we get 

x^ = ao(^x^"^ + b) + + b) + • • • + ap.i{A^^-^ + b) 

+ bgx’^-P + 6ix^-(P-^) + • • • + 6p_ix^-i 

Finally, by substituting Ax + b for its analog V/(x), we arrive at the following canonical 
hrst-order extension of A 


p—1 

x*^ = + ^ajV/(x''-(P-^)) 

j=0 j=0 


(46) 


Being applicable to a much wider collection of functions, how well should we expect the 
canonical extensions to behave? The answer is that when initialized close enough to the 
minimizer, one should expect a linear convergence of essentially the same rate. A formal 
statement is given by the theorem below which easily follows from Theorem 1 in Section 
2.1, iPolyakl (1 19871) for 


5( 


^k-p 


p—1 

xfc-1) = ^ 5^.x'=-(P-?) + ^ajV/(x^-(P-^)) 
1=0 1=0 
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Theorem 9 Suppose f : ^ M. is an L-smooth n-strongly convex function and let x* 

denotes its minimizer. Then, for every e > 0, there exist 5 > 0 and C > 0 such that if 

llx^ - x*|| <6, j = 0,...,p-l 


then 


X — X 


< W 


+ e)^ 


k = p,p + l, 


where 


p* = sup p 
»?es 


p-i \ 

+ bjW 

i=o / 


Unlike general p-SCLIs with linear coefficient matrices which are guaranteed to converge 
only when initialized close enough to the minimizer, AGD converges linearly, regardless of 
the initialization points, for any smooth and strongly convex function. This merits further 
investigation as to the precise principles which underlie p-SCLIs of this kind. 
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An appendix 


A. Optimal p-SCLI for Unconstrained Coefficient Matrices 


In the sequel we employ Scheme 3 (see Section 3.4) to show that, when no constraints are 
imposed on the functional dependency of the coefficient matrices, the lower bound shown 
in Theorem]^ is tight. To this end, recall that in Lemmawe have shown that the lower 
bound on the maximal modulus of roots of a polynomials which evaluate at z = 1 to some 
r > 0 is uniquely attained by the following polynomial 




Thus, by choosing coefficients matrices which admit the same form, we obtain the optimal 
convergence rate as stated in Theorem 


Goncretely, let p G N be some lifting factor, let N{X) = i/Id, v G {—2P/L,0) be a fixed 
scalar matrix and let /A,b(x) G Q'^(S) be some quadratic function. Lemmaimplies that 
for each rj G a{—uA) we need the corresponding factor of the characteristic polynomial to 
be 


(“ 7 ) 

k=0 ^ ^ 


This is easily accomplished using the spectral decomposition of A by 

A = U^AU 

where U is an orthogonal matrix and A is a diagonal matrix. Note that since A is a positive 
definite matrix such a decomposition must always exist. We define p coefficient matrices 
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Co, Cl,, Cp_i in accordance with Equation (47) as follows 


Ck = U 


-(fc) - 1) 


p—k 


\ 


-a)(^=^-irv 


u' 


By using Theorem it is easily verified that these coefficient matrices form a consistent 
p-SCLI optimization algorithm whose characteristic polynomial’s root radius is 


max I ^—v^j — 1 


Choosing 


2 

according to Table produces an optimal p-SCLI optimization algorithm for this set of 
parameters. It is noteworthy that other suitable decompositions can be used for deriving 
optimal p-SCLIs, as well. 




As a side note, since the cost of computing each iteration in G grows linearly with 
the lifting factor p, the optimal choice of p with respect to the condition number k yields 
a p-SCLI optimization algorithm whose iteration complexity is 0(ln(K) ln(l/e)). Clearly, 
this result is of theoretical interest only as this would require a spectral decomposition 
of A, which, if no other structural assumptions are imposed, is an even harder task than 
computing the minimizer of fA,h{^)- 


B. Lifting Factor > 3 


In Section 


4.2 


I we conjecture that for any p-SCLI optimization algorithm A = (C(X, X), N{X)), 
with diagonal inversion matrix and linear coefficient matrices there exists some A G L]) 

such that 


Pa(/:(A,X))>^^^ (48) 

+ 1 

However, it may be possible to overcome this barrier by focusing on a subclass of Q^([p, L]). 
Indeed, recall that the polynomial analogy of this conjecture states that for any monic real 
p degree polynomial q{z) such that q{l) = 0 and for any polynomial r{z) of degree p — 1, 
there exists p G [p, L] such that 


p{q{z) - r]r{z)) > 


- 1 

■\/k + 1 


where k = L/p. This implies that we may be able to tune q{z) and r(z) so as to obtain 
a convergence rate, which breaks Inequality (48), for quadratic function whose Hessian’s 
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spectrum does not spread uniformly across 


Let us demonstrate this idea for p = 3, /r = 2 and L = 100. Following the exact same 
derivation used in the last section, let us pick 

q{z,'q) = zP - {r]a{z) + b{z)) 

numerically, so that 

q{z,ix) = - (1 - 

Qiz,L) = 

where 

(W+^) 

The resulting 3-CLI optimization algorithm ^3 is 

= C2iX)x^-^ + Ci(X)x^-2 + Co(X)x*^-3 + N{X)b 

where 

Co{X) « 0.1958/rf - 0.0038X 
C'i(X) « -0.9850Id 
C2{X) « 1.7892/rf - 0.0351X 
N{X) PS -0.0389Id 

It is noteworthy that as opposed to the algorithm described in Section when employing 
linear coefficient matrices no knowledge regarding the eigenvectors of A is required. As each 
eigenvalue of the second-order derivative corresponds to a bound on the convergence rate, 
one can verify by Figure]^ that 

for any X G Q'^([2,100]) which satisfies 

a{A) C S = [2, 2 -k e] U [100 - e, 100], e ps 1.5 
Thus, ^3 outperforms AGD for this family of quadratic functions. 


Let us demonstrate the gain in the performance allowed by ^3 in a very simple setting. 
Define A to be Diag(p,L) rotated counter-clockwise by 45°, that is 




1 






ljj-\-L fi—L 
H —L ii-\-L 

~T~ ~T~ 


35 


A = fj, 







Figure 3: The convergence rate of AGD and ^3 vs. the eigenvalues of the second-order 
derivatives. It can be seen that the asymptotic convergence rate of ^3 for 
quadratic functions whose second-order derivative comprises eigenvalues which 
are close to the edges of [2,100], is faster than AGD and goes below the theoret¬ 
ical lower bound for first-order optimization algorithm . 
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Figure 4: The error rate of ^ 3 , AGD and HB vs. iterations for solving a simple quadratic 
minimization task. The convergence rate of ^3 is bounded from above by 

v ^ 

as implied by theory. 


Furthermore, define b = — A (100,100)"'". Note that /y 4 ,b(x) G and that its mini- 

mizer is simply (100,100)^. Figure shows the error of ^ 3 , AGD and HB vs. iteration 
number. All algorithms are initialized at = 0. Since ^3 is a first-order optimization 
algorithm, by the lower bound shown in (i there must exist some quadratic function 
/Tb,bib(x) G Q^([/x,L]) such that 


I-Caz (g/A lb,bib W) > ^ (\/Kln(l/e)) 


(49) 


But, since 


(G/A,b(x)) < C>(^ln(l/e)) 


(50) 


for every /A,b(x) G 2 j > we must have /Aib,bib(x) G Q^{[^i,L]) \ Q Indeed, in 

the somewhat simpler form of the general lower bound for first-order optimization algo¬ 


rithms, Nesterov (see Nesterov (2004)) considers the following 1-smooth 0-strongly convex 
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functiorF^ 


/ 2 -1 0 ... 

- 12-10 ... 
0 - 12-10 




0 \ 
0 
0 


0 

V 0 


0 - 12-1 
... 0 - 12 / 


bib = — 


/ 1 \ 
0 

V 0 


As demonstrated by Figure]^ <7(^ib) densely fills [/U,L]. 


1 r 

0.5 - 

0 — 
0 

1 r 

0.5 - 

0 — 
0 

1 r 

0.5 - 

0 — 
0 

1 r 

0.5 - 
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0.1 0.2 0.3 0.4 ,0.5 0.6 0.7 0.8 0.9 

d = 7 
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0.7 0.8 0.9 1 


Figure 5: The spectrum of ^ib, as used in the derivation of Nesterov’s lower bound, for 
problem space of various dimensions. 


Consequently, we expect that whenever adjacent eigenvalues of the second-order deriva¬ 
tives are relatively distant, one should able be to minimize the corresponding quadratic 
function faster than the lower bound stated in[^ This technique can be further generalized 
to p > 3 using the same ideas. Also, a different approach is to use quadratic (or even higher 
degree) coefficient matrices to exploit other shapes of spectra. Clearly, the applicability of 
both approaches heavily depends the existence of spectra of this type in real applications. 


10. Although /Aib.bib (x) is not strongly convex, the lower bonnd for strongly convex function is obtained by 
shifting the spectrum using a regnlarization term /i/2 ||x|/. In which case, the shape of the spectrum is 
preserved. 
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C. Proofs 

C.l Proof of Theorem [ 4 ] 

The simple idea behind proof of Theorem is to express the dynamic of a given p-SCLI 
optimization algorithm as a recurrent application of linear operator. To analyze the latter, 
we employ the Jordan form which allows us to bind together the maximal magnitude eigen¬ 
value and the convergence rate. Prior to proving this theorem, we first need to introduce 
some elementary results in linear algebra. 


C.1.1 Linear Algebra Preliminaries 

We prove two basic lemmas which allow to determine under what conditions does a recur¬ 
rence application of linear operators over finite dimensional spaces converge, as well as to 
compute the limit of matrices powers series. It is worth noting that despite of being a very 
elementary result in Matrix theory and in the theory of power methods, the lower bound 
part of the first lemma does not seem to appear in this form in standard linear algebra 
literature. 


Lemma 10 Let A be a d x d square matrix. 

• If p{A) > 0 then there exists Ca > 0 such that for any u G and for any k ^ N we 
have 




u 


< CAk^-^p{Af ||u| 


where m denotes the maximal index of eigenvalues whose modulus is maximal. 

In addition, there exists ca > 0 and r G such that for any u G which satisfies 
(u,r)/0 we have 


A* 


u 


> CAk^-^p{A) 


u 


for sufficiently large /c G N. 


• If p{A) = 0 then A is a nilpotent matrix. In which case, both lower and upper bounds 
mentioned above hold trivially for any u G for sufficiently large k. 

Proof Let P he a d x d invertible matrix such that 

P-'^AP = J 

where J is a Jordan form of A, namely, J is a block-diagonal matrix such that J = 
©1^1 Jfc. (Aj) where Ai, A 2 ,..., Xs are eigenvalues of A, whose indices are ki,..., kg, respec¬ 
tively. w.l.o.g we may assume that | Ai | = p{A) and that the corresponding index, which 
we denote by m, is maximal over all eigenvalues of maximal magnitude. Let Qi, Q 2 ) •'' ^Qs 
and denote partitioning of the columns of P and the rows of P~^, respec¬ 

tively, which conform with the Jordan blocks of A. 
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Note that for all i G [d], JkiiO) is a nilpotent matrix of an order ki. Therefore, for any 
{Xi,ki) we have 


=E(-)T'A.(«y 

j=o 

= EG)Tv..(or 

j=o 


Thus, 




j=0 
h-1 (k\ 

E 

j=0 


k^-'^X’l 


^y jk^joy 

\XiJ 


(51) 


The rest of the proof pivots around the following equality which holds for any u G 


A’^u 


PJ'^P-^U 


Y,Q^JkM^yR^u 


i=l 

fn—l A\k 


= k'^-^p{Ay 


s 

J2Qi {jkyxi)/ik^-^x’^y Riu 

i=l 


(52) 


Plugging in [sT] yields. 




= K^-^piAy 


Eft 

1=1 

Af (‘) 

{X^yjk^oy] 

RiU 

J^m-l 

\j=o 

K ) 

Wfc 


(53) 


Let us denote the sequence of vectors in the l.h.s of the preceding inequality by 
Showing that the norm of is bounded from above and away from zero will conclude 

the proof. Deriving an upper bound is straightforward. 


Iwfcll < 


2=1 


' kj — l {k\ 




k^-^ \X 


1/ 

k\ 


siiuiiEiiOiiiiiftiiE 

i=l j=0 






(54) 
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Since for all i e [d] we have 



j^m—1 



or 






1 


it holds that Inequality ( [54) ) can be bounded from above by some positive scalar Ca- Plug¬ 
ging it in into yields 




< CAk'^-^p{Af ||u 


Deriving a lower bound on the norm of {w^} is a bit more involved. First, we define the 
following set of Jordan blocks which govern the asymptotic behavior of ||wfc|| 


X = {i G [s] I I Aj I = p{A) and ki = m} 


Equation (51) implies that for all i ^ X 

0 as /c —)■ oo. 


As for i G X, the first fej — 1 terms in Equation (51) tend to zero. The last term is a matrix 
whose entries are all zeros, except for the last entry in the first row which equals 


L-i) (h 


k 


ra—1 


y ) i/(Ar^)^(T^) v(Ar^) 


A* 


Ai 


By denoting the first column of each Qi by qi and the last row in each Ri by r- , we get 

k 

aJ a? 


|Wfc| 


X/ ( \, 1 ^RiU 


iex 


E 

iex 


Ai 


\ \ ^ T 

Xi \ qir- u 


a: 


m—1 


Now, if u satisfies r^u 7 ^ 0 then since qi,q2, • • • , q\x\ are linearly independent, we see that 
the preceding can be bounded from below by some positive constant ca > 0 which does 
not depend on k. That is, there exists ca > 0 such that ||wfc|| > ca for sufficiently large k. 


Plugging it in into Equation (53) yields 


|^u|| > p{A) ||u 


for any u G such that (u, ri) 7 ^ 0 and for sufficiently large k. 


The following is a well-known fact regarding Neuman series, sum of powers of square 


matrices, which follows easily from Lemma 10 
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Lemma 11 Suppose A is a square matrix. Then, the following statements are equivalent: 


1. p{A) < 1 . 

2 . limfc^oo = 0. 

3. converges. 

In which case, (/ — A)~^ exists and (/ — A)~^ = ■ 


Proof First, note that all norms on a finite-dimensional space are equivalent. Thus, the 
claims stated in Q and Q are well-defined. 

The fact that 
equivalence of 



are equivalent is a direct implication of Lemma 10 
may be established using the following identity 


Finally, the 


m —1 

{I - A)^ A^ = I - A^^ 

k=0 


m G N 


C.1.2 Convergence Properties 

Let us now analyze the convergence properties of p-SCLI optimization algorithms. First, 
note that update rule (14) can be equivalently expressed as a single step rule by introducing 
new variables in some possibly higher-dimensional Euclidean space 


z0= fx0,x\ 




= M{X)z'‘-^ + UN{X)h, A: = 1,2,... (55) 


where 


u ^{0d,...,0d,idV e 

^^ 

p—1 times 


(56) 


and where M{X) is a mapping from to valued random variables which admits 

the following generalized form of companion matrices 


/ Od 


\ Co{X) 


Id 

Od Id 


\ 


Od Id 

Cp.2{X) Cp_i(X) / 


(57) 


Following the convention in the field of linear iterative methods, we call M{X) the iteration 
matrix. Note that in terms of the formulation given in (55), consistency w.r.t A G 5'^(S) is 
equivalent to 




(58) 


p times 
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regardless of the initialization points and for any b £ and z® £ 


To improve readability, we shall omit the functional dependency of the iteration, inver¬ 
sion and coefficient matrices on X in the following discussion. Furthermore, Equation (55) 
can be used to derive a simple expression of z^, in terms of previous iterations as follows 


= M(°)z° + UN^^'>h 
z^ = 

z^ = m(2)z2 + + UN^‘^')h 


z 


k 


k-1 


k—1 k—1 


n +E n 

j=0 rn=l j=m 


k—1 k 

j=0 m=l 



[/Ar("i-i)b 


where ,..., are k i.i.d realizations of the corresponding iter¬ 

ation matrix and inversion matrix, respectively. We follow the convention of defining an 
empty product as the identity matrix and defining the multiplication order of factors of 
abbreviated product notation as multiplication from the highest index to the lowest, i.e., 
0^=1 Taking the expectation of both sides yields 


' k-l 


Ez^ = E[M]*^z° + E[C/iVb] 


(59) 


vf=0 


By Lemma 11, if p(EM) < 1 then the first term in the r.h.s of Equation (59) vanishes for 
any initialization point z*^, whereas the second term converges to 


[I - EM)-%UNh] 


the fixed point of the update rule. On the other hand, suppose that (Ez^)^^ converges 
for any z^ £ Then, this is also true for z® = 0. Thus, the second summand in the r.h.s 
of Equation (59) must converge. Consequently, the sequence E[M]^z‘^, being a difference of 
two convergent sequences, converges for all z^, which implies p{K[M]) < 1. This proves the 
following theorem. 


Theorem 12 With the notation above, (Ez^)^^ converges for any if and only if 

p{K[M]) < 1. In which case, for any initialization point z^ £ the limit is 

z* = {I- EM)~^ E[[/iVb] (60) 
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We now address the more delicate question as to how fast do p-SCLIs converge. To this 
end, note that by Equation (59) and Theorem 12 we have 

/k-l \ 


E 




E[M]''z° + ^E[M]* E[i7A^b] - {I - EM)~^E[UNh] 


^l=0 




k-l 


E[M]*^z° + (/ - EM)~^ (/ - EM) E[MY - I E[i7A^b] 


z=o 


= E[M]''z° -{I - EMY {EM)’^E[UNh] 
= E[Mf{z^-z*) 


(61) 


Hence, to obtain a full characterization of the convergence rate of ||E [z*^ — z*] || in terms of 
p(EM), all we need is to simply apply Lemma 10 with EM. 

C.1.3 Proof 

We are now in position to prove Theorem 0 Let A = {C{X,X),N{X)) be a p-SCLI 
algorithm over M'^, let M{X) denote its iteration matrix and let /y 4 ,b(x) be some quadratic 
function. According to the previous discussion, there exist m G N and C{A),c{A) > 0 such 
that the following hold: 

1. For any initialization point z^ G we have that (Ez*^)^^ converges to 


z* = {I- EM(A))"^ E [UN{A)h] 


2. For any initialization point z'^ G and for any /i G N 


E 


z'^ -z* 


< CAk'^-^p{M{A)f ||z° - z 


3. There exists r G 


such that for any initialization point z^ G 


— z*,r) 7 ^ 0 and sufficiently large /c G N, 


E 


z^ -z* 


>CAk'^-^p{M{A)f\\z^-z 


ll„o 


(62) 

(63) 

which satisfies 

(64) 


Since iteration complexity is defined over the problem space, we need to derive the same 
inequalities in terms of 


= U^z’^ 


Note that by linearity we have x* = U z*. For bounding (x^)^^ from above we use (63), 


E 


x^-x* 


E 


[/Tzfc _ 


< 

G 

E 

IS 

1 

IS 

* 


< 

G 

CaA:™-V(M)^||z°-z*|| 

= 

G 

CAk'^-^p{M)^\\U^° -Ux.* 

< 

G 

\\U\\CAk^-^ 

p(M)^ -X 


(65) 
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Thus, the same rate as in (63), with a different constant, holds in the problem space. 
Although the corresponding lower bound takes a slightly different form, it proof is done 

such that the corresponding z® is satisfied the condition in 


similarly. Pickx^,x^, 

(64). For sufficiently large A: G N, it holds that 


, X' 


max 

A:=0,...,p—1 


Ex*'+^' - Ex* 



Vp 


Vp 




p-i 

El 

j=o 


xJ 



( 66 ) 


We arrived at the following corollary which states that the asymptotic convergence rate 
of any p-SCLI optimization algorithm is governed by the spectral radius of its iteration 
matrix. 


Theorem 13 Suppose A is a p-SCLI optimization algorithm over Q'^(E) and let M{X) 
denotes its iteration matrix. Then, there exists m G N such that for any quadratic function 


/A,b(x) G Q'^(S) it holds that 


E 



o(^k"^-^p{M{X)V ||x° - x*||) 


where x* denotes the minimizer of f a, b{^)- Furthermore, there exists an initialization point 
xO g such that 


max 

k=0,...,p—l 


Ex^+^' - Ex* 


= n 


' m—1 


Vp 


-piMixV 


I 0 * 

X — X 


Finally, in the next section we prove that the spectral radius of the iteration matrix 
equals the root radius of the determinant of the characteristic of polynomial by showing 
that 


det(A/- M(A:)) = det(£(A,A:)) 


Combining this with the corollary above and by applying Inequality (12) and the like, 
concludes the proof for Theorem 


C.1.4 The Characteristic Polynomial of the Iteration Matrix 

The following lemma provides an explicit expression for the characteristic polynomial of iter¬ 
ation matrices. The proof is carried out by applying elementary determinant manipulation 
rules. 
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Lemma 14 Let M{X) he the matrix defined in J57| j and let A he a given d x d square 
matrix. Then, the characteristic polynomial of KM (A) can he expressed as the following 
matrix polynomial 


X¥.M{A)W — (-l)^'^det 


p-i 


XMd - Y, X'^KCki^A) 


fc =0 


( 67 ) 


Proof As usual, for the sake of readability we omit the functional dependency on A, as 
well as the expectation operator symbol. For A 7^ 0 we get, 


XAt(A) = det(M - XIpd) 

( -\Id h 

-Xh Id 

= det 


-Xld 


= det 


Co 

• . . Cp_2 

Cp_i 

-Xh / 

-\h 

h 




-Xh h 





-Xh 

h 

Od 

Ci + X-^Co ... 

Cp-2 

4 ? 

1 

7 

0" 


= det 


-Xh 

h 



\ 


-Xh 

h 






-Xh 

h 

Od 

Od 

C2 + X-^Ci + X-'^Co .. 

Cp-2 

Cp-i — Xh / 


= det 


-Xh h 

\ 

-Xh h 


-Xh 

h 

0 

0 



dei{-XhY-^ det Y - Xh 


\k=l 
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\k=l 


p-1 


{-ir'^detiXPh-Y.X'^Ck 


k=0 



By continuity we have that the preceding equality also holds A 


0, as well. 


C.2 Proof of Theorem [5] 

We prove that consistent p-SCLI optimization algorithms must satisfy conditions (© and 
(18). The reverse implication is proven by reversing the steps of the following proof. 


First, note that (18) is an immediate consequence of Corollary 13, according to which 


p-SCLIs converge if and on 


y if the the root radius of the characteristic polynomial is strictly 
I, let A = {C{X, X), N{X)) be a consistent p-SCLI optimization 
algorithm over Q‘^(S) and let /A,b(x) £ be a quadratic function. Furthermore, let 


smaller than 1. As for (18 


us denote the corresponding iteration matrix by M{X) as in (57). By Theorem 12 for any 
initialization point we have 


Ez'' 


(I - EM(y4))"^ UE[N{A)]h 


where U is as defined in (56), i.e.. 


u = {0d,...,0d,idV e 

^ V 

p—1 times 

For the sake of readability we omit the functional dependency on A, as well as the expec¬ 


tation operator symbol. Combining this with Equation (58) yields 

(/ - M)-^ UNh = -A-^h 
Since this holds for any b £ we get 

(/ - M)~^ UN = 

Evidently, N is an invertible matrix. Therefore, 


{I -M)~' U = -{NA) 




Now, recall that 


/ Orf Id 

Od 


-1 


( 68 ) 


M = 


V Co 


Od Id 
Cp—2 Cp—1 ) 
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where Cj denote the coefficient matrices. We partition M as follows 


Mil 

Mi 2 

M21 

M22 


/ Orf Id 

\ 

Od Id 


Od 

Id 

\ Co ... Cp.2 

Cp-i / 


The l.h.s of Equation ( 68 ) is in fact the inverse of the Schur Complement of / — Mu in 
I — M, i.e., 


(/ - M 22 - M 2 i(I - Mii)- 1 Mi 2 )-^ = -{NA)-^ 

I-M 22 - M 2 i{I - Mii)-1Mi2 = -NA 

M 22 + M 2 i(/ — Mii)“^Mi 2 = / + N A (69) 


Moreover, it is straightforward to verify that 


(/-Mn)"^ = 


/ h Id Id \ 
Id Id 


\ 


Id / 


Plugging in this into (69) yields 


equivalently, 


p-i 

Y,Ci = I+ NA 

i=0 


C{l,A) = -NA 


Thus concludes the proof. 


(70) 


C.3 Proof of Lemma [6] 

First, we prove the following Lemma. Let us denote 


where r is some non-negative constant. 

Lemma 15 Suppose q{z) is a monic polynomial of degree p with complex coefficients. Then, 






9(^) 


9|,(i)|(^) 
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Proof As the statement is clear, we prove here the only the => part. 

By the fundamental theorem of algebra q{z) has p roots. Let us denote these roots by 
Cl, C 2 , • ■ •, Cp e C . Equivalently, 


p 

(l{z) = 

i=l 


Let us denote r = 1 17 ( 1 ) |. If r > 1 we get 


r = 


11(1 - C< 


2 = 1 
P 




2=1 


2 = 1 


<n(i+i^-ii)=n(i+^-i) 


— L] = r 


2=1 


2=1 


Consequently, Inequality 0 becomes an equality. Therefore, 

I 1 - Ci I = 1 + I Ci I = Vi G [p] 

Now, for any two complex numbers tc, z G C it holds that 

\w + z \ = \w \ + \ z\ Aigiw) = Arg( 2 ;) 


(71) 


(72) 


Using this fact in the first equality of Equation (72), we get that Arg(—Ci) = Arg(l) = 0, 
i.e., Ci are negative real numbers. Writing —Ci in the second equality of Equation (72) 
instead of | Ci |, yields 1 — Ci = concluding this part of the proof. 

The proof for r G [0,1) follows along the same lines, only this time we use the reverse 
triangle inequality. 


2=1 2 = 1 2=1 

P 

= n “ ^))= 'T 
2 = 1 

Note that in the first inequality, we used the fact that r G [0,1) => | Ci | < 1 for all i. ■ 

The proof for Lemma now follows easily. In case q{l) > 0, if q{z) = {z — {1 — 
then, clearly. 


p{q{z)) = p{{z-{l- {fPf)) = I 1 - ^ I 


Otherwise, according to Lemma 15 


p{<liz)) > I 1 - ^1 
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In case (;(1) < 0, we must use the assumtpoin that the coefficients are reals (see Remark 
16), whereby the mere fact that 


lim q{z) = oo 

combined with the Mean-Value theorem implies p{q{z)) > 1. This concludes the proof. 

Remark 16 The requirement that the coefficients of q{z) should be real is inevitable. To 
see why, consider the following polynomial, 

u{z) = — ^1 — 0.5e^ ^ ^ 


Although ri(l) = ^1 — ^1 — l/2e 3 = —1/8 < 0, it holds that p(u(l)) < 1. Indeed, not 

all the coefficients ofu{z) are real. Notice that the claim does hold for degree < 3, regardless 
of the additional assumption on the coefficients ofu{z). 

C.4 Bounding the spectral radius of diagonal inversion matrices from below 
nsing scalar matrices 

We prove a lower bound on the convergence rate of p-SCLI optimization algorithm with 
diagonal inversion matrices. In particular, we show that for any p-SCLI optimization algo¬ 
rithm whose inversion matrix is diagonal there exists a quadratic function for which it does 
not perform better than p-SCLI optimization algorithms with scalar inversion matrix. We 
prove the claim for d = 2. The general case follows by embedding the 2-dimensional case 
as a principal sub-matrix in some higher dimensional matrix in L]). 

Let A = {M{X),N{X)) be a p-SCLI optimization algorithm and assume that N{X) is 
a diagonal matrix. Define the following positive definite matrix 

L—ft \ 

^ dp (73) 

2 2 / 

And note that cr{B) = {p, L}. As usual, we wish to derive a lower bound on p{M{B)). To 
this end, denote 


N = N{B) 


a 0 \ 

0 /3 ) 


where a,/3 G M. By a straightforward calculation we get that the eigenvalues of —NB are 


o'i,2(a,/3) 


— (a + /3){L + p) ^ 

4 

— {a + I3){L + p) ^ 

4 


(a -|- /3)(L -I- p) 


— a(3Lp 




(74) 
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Using similar arguments to the ones which were applied in the scalar case, we get that both 
eigenvalues of —NB must be strictly positive as well as satisfy 


p(M)>minmax| ^ai{a, /3) - I , •^ 0 - 2 ( 0 ;, /3) - I \ 


(75) 


Equation (74) shows that the minimum of the preceding is obtained for u = which 
simplifies to 


max 


1 

^a2{a,(3) - 1 

1 > max 1 

^ai{u,u) - 1 





= max 1 

1 - 11 , 


} 


The rest of the analysis is carried out similarly to the scalar case, resulting in 


p{M{B)) > 


{/Q + l 
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